Search CORE

17 research outputs found

Where to Begin? On the Impact of Pre-Training and Initialization in Federated Learning

Author: Malik Kshitiz
Nguyen John
Rabbat Michael
Sanjabi Maziar
Wang Jianyu
Publication venue
Publication date: 24/03/2023
Field of study

An oft-cited challenge of federated learning is the presence of heterogeneity. \emph{Data heterogeneity} refers to the fact that data from different clients may follow very different distributions. \emph{System heterogeneity} refers to client devices having different system capabilities. A considerable number of federated optimization methods address this challenge. In the literature, empirical evaluations usually start federated training from random initialization. However, in many practical applications of federated learning, the server has access to proxy data for the training task that can be used to pre-train a model before starting federated training. Using four standard federated learning benchmark datasets, we empirically study the impact of starting from a pre-trained model in federated learning. Unsurprisingly, starting from a pre-trained model reduces the training time required to reach a target error rate and enables the training of more accurate models (up to 40\%) than is possible when starting from random initialization. Surprisingly, we also find that starting federated learning from a pre-trained initialization reduces the effect of both data and system heterogeneity. We recommend future work proposing and evaluating federated optimization methods to evaluate the performance when starting from random and pre-trained initializations. This study raises several questions for further work on understanding the role of heterogeneity in federated optimization. \footnote{Our code is available at: \url{https://github.com/facebookresearch/where_to_begin}}Comment: Accepted at ICL

arXiv.org e-Print Archive

PaCo: Probability-based Path Confidence Prediction

Author: Agarwal Mayank
Dhar Vikram
Frank Matthew I.
Malik Kshitiz
Publication venue: Coordinated Science Laboratory, University of Illinois at Urbana-Champaign
Publication date: 01/12/2007
Field of study

Coordinated Science Laboratory was formerly known as Control Systems LaboratoryNational Science Foundation / CCR-042971Gigascale Systems Research Cente

Illinois Digital Environment for Access to Learning and Scholarship Repository

Scipedia

Effective Long-Context Scaling of Foundation Models

Author: Bhargava Prajjwal
Bhosale Shruti
Edunov Sergey
Fan Angela
Fang Han
Hou Rui
Khabsa Madian
Lewis Mike
Liu Jingyu
Ma Hao
Malik Kshitiz
Martin Louis
Mehdad Yashar
Molybog Igor
Narang Sharan
Oguz Barlas
Rungta Rashi
Sankararaman Karthik Abinav
Wang Sinong
Xiong Wenhan
Zhang Hejia
Publication venue
Publication date: 13/11/2023
Field of study

We present a series of long-context LLMs that support effective context windows of up to 32,768 tokens. Our model series are built through continual pretraining from Llama 2 with longer training sequences and on a dataset where long texts are upsampled. We perform extensive evaluation on language modeling, synthetic context probing tasks, and a wide range of research benchmarks. On research benchmarks, our models achieve consistent improvements on most regular tasks and significant improvements on long-context tasks over Llama 2. Notably, with a cost-effective instruction tuning procedure that does not require human-annotated long instruction data, the 70B variant can already surpass gpt-3.5-turbo-16k's overall performance on a suite of long-context tasks. Alongside these results, we provide an in-depth analysis on the individual components of our method. We delve into Llama's position encodings and discuss its limitation in modeling long dependencies. We also examine the impact of various design choices in the pretraining process, including the data mix and the training curriculum of sequence lengths -- our ablation experiments suggest that having abundant long texts in the pretrain dataset is not the key to achieving strong performance, and we empirically verify that long context continual pretraining is more efficient and similarly effective compared to pretraining from scratch with long sequences

arXiv.org e-Print Archive

Critical Branches and Lucky Loads in Control-Independence Architectures

Author: Malik Kshitiz
Publication venue
Publication date
Field of study

148 p.Thesis (Ph.D.)--University of Illinois at Urbana-Champaign, 2009.I perform a thorough analysis of the performance sensitivity of CI processors to disambiguation and forwarding. The insights from this analysis are used to drive the design of hardware mechanisms to perform these two functions that are low in complexity and yet attain high performance. The basic premise behind these mechanisms is to use small caches to perform early disambiguation and forwarding. These caches are not responsible for ensuring correctness; they merely enable high performance in the presence of lucky loads. The caches are backed up by a simple load re-execution mechanism that guarantees correctness. I find that the performance of a CI processor with small (32-entry and 128-entry) structures for disambiguation and forwarding, respectively, is within 10% of global load and store queues in the worst case.U of I OnlyRestricted to the U of I community idenfinitely during batch ingest of legacy ETD

Illinois Digital Environment for Access to Learning and Scholarship Repository